The NFL is a multi-billion dollar business. There are millions of fans around the world, possibily even yourself, that invest a significant amount of time and resources to follow and watch their favorite teams and players. The teams are always working to fufill their duty to the fans. With the money they receive, they work on providing fans with the best experience possible, whether that's through giving them a great experience at the game with stadium rennovations or through building a winning team that makes fans want to tune in more. Every year, teams get around 225 million dollars to spend on players, with the money being paid through contracts. It's pretty straightforward, teams spend money on players that they think will benefit the team, and all is good, right? Obviously the answer is no, and there is a lot of nuance and subtleties that go into how much to pay players. Time and time again teams either give way too much money to certain players which ends up hurting the team, and many times teams avoid signing a certain player who ends up being a bargain for another. Then there is the question of how do we weigh which positions are worth more money. Each position has vastly different roles on a team, and therefore some are paid better than others. This however, also depends on the scheme the team has as that can affect how much they value each position. Then there are many more factors such as age and past performance that have to be accounted for. Then after all of that, even if done perfectly to the book can still end up not working. A player could not be a hard worker or lazy or had a fluke year, all of which are very hard to account for. So which are the factors you can account for? This is the question we will answer today, by looking through both basic box score predictions as well as ProFootballFocus' grading system, which breaks down the tape for every play of every player. This brings an element of context to the stats, which also can be deceiving based on what happens in a play, for example a quarterback can throw a perfect pass that gets dropped which would be considered an incompletion. We will do analysis on all different types of stats, advanced and basic, to determine which ones correlate with salary the best or how much each one correlates. This will give us a better picture of how we can predict what teams should pay a player based on their past performance and how much value in general they should add to a team. We will break down each category individually, by passing, rushing, receiving, blocking, run defense, pass rush and coverage, which will give us a nice split of positions and will allow us to differentiate the value of certain skills.

In [1014]:
import pandas as pd
import numpy as np
import warnings

# filter out warnings
warnings.filterwarnings('ignore')

# get salary data 
tables = pd.read_html('https://www.spotrac.com/nfl/contracts/sort-value/limit-2000/', flavor='html5lib')

First we must collect data, to do this we used the python library Pandas to read the html. Since the website we are collecting from stores the data we're looking for in a table tag we can easily grab it and store it in a dataframe. This website contains information for players salary including when they signed, total value, average annual value (AAV), and information relating to guaranteed salary.

In [1015]:
import re

salary_df = tables[0].rename(columns={'Player': 'player'})
players = salary_df['player']
pl = []
start_years = []
end_years = []

for player in players:
    try:
        # split up player column and extract start year end year and name
        groups = re.search(r'(.*)  .* \| (\d{4})-(\d{4}) \(FA: (\d{4})\)', player)
        pl.append(groups.groups()[0])
        start_years.append(groups.groups()[1])
        end_years.append(groups.groups()[2])
    # if regex fails dont throw error
    except:
        pl.append(None)
        start_years.append(None)
        end_years.append(None)

# insert clean data
salary_df['player'] = pl
salary_df['start_year'] = start_years
salary_df['end_year'] = end_years

# turn salaries totals into ints
def convert_to_int(v):
    try: return int(v.replace('$', '').replace(',',''))
    except: return 0


salary_df['Value'] = salary_df['Value'].apply(convert_to_int)
salary_df['AAV'] = salary_df['AAV'].apply(convert_to_int)
salary_df['Sign Bonus'] = salary_df['Sign Bonus'].apply(convert_to_int)
salary_df["G'teed @ Sign"] = salary_df["G'teed @ Sign"].apply(convert_to_int)
salary_df["Practical G'teed"] = salary_df["Practical G'teed"].apply(convert_to_int)
salary_df['start_year'] = salary_df['start_year'].apply(convert_to_int)
salary_df['end_year'] = salary_df['end_year'].apply(convert_to_int)
salary_df.fillna(0)
salary_df
Out[1015]:
Rank player Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 1 Patrick Mahomes 24 10 450000000 45000000 10000000 63081905 141481905 2020 2031
1 2 Lamar Jackson 26 5 260000000 52000000 72500000 135000000 185000000 2023 2027
2 3 Josh Allen 25 6 258034000 43005667 16500000 100038596 150000000 2021 2028
3 4 Jalen Hurts 24 5 255000000 51000000 23294000 110000000 179399000 2023 2028
4 5 Russell Wilson 33 5 242588236 48517647 50000000 124000000 161000000 2022 2028
... ... ... ... ... ... ... ... ... ... ... ...
1995 1995 Joshua Miles 27 1 1162500 1162500 152500 352500 352500 2023 2023
1996 1997 Joe Bachie 25 1 1160000 1160000 75000 75000 75000 2023 2023
1997 1998 Jordan Willis 27 1 1156250 1156250 0 0 0 2023 2023
1998 1999 Tim Boyle 28 1 1155000 1155000 0 75000 75000 2023 2023
1999 1999 Marcus Kemp 27 1 1155000 1155000 50000 50000 50000 2023 2023

2000 rows × 11 columns

Now we need to get the player stats. Do do this we extracted csv files from pff.com which contain many useful statistics for each position. Again we are going to store this in a pandas dataframe to be consistent, since we will use all the datasets together. We also need to clean the data. We do this by melting the data and removing any years in which very minimal snaps were played and to only include positions we want. We want to differentiate the years because of how important it is to account for. A player having a good season 5 years ago is not nearly the same as them playing well in the last season.

In [1016]:
# passing players and stats

passing_dfs = []

# passing

# 2022
passing_dfs.append(pd.read_csv('./passing/passing_summary.csv'))
passing_dfs[0]['year'] = 2022

# 2021
passing_dfs.append(pd.read_csv('./passing/passing_summary (1).csv'))
passing_dfs[1]['year'] = 2021

# 2020
passing_dfs.append(pd.read_csv('./passing/passing_summary (2).csv'))
passing_dfs[2]['year'] = 2020

# 2019
passing_dfs.append(pd.read_csv('./passing/passing_summary (3).csv'))
passing_dfs[3]['year'] = 2019

# 2018
passing_dfs.append(pd.read_csv('./passing/passing_summary (4).csv'))
passing_dfs[4]['year'] = 2018


# 2017
passing_dfs.append(pd.read_csv('./passing/passing_summary (5).csv'))
passing_dfs[5]['year'] = 2017

# create dataframe
passing_df = pd.concat(passing_dfs)

# remove outliers
passing_df = passing_df[passing_df['position'] == 'QB']
passing_df = passing_df[passing_df['attempts'] > 100]

# merge players with salary
passing_df = pd.merge(passing_df, salary_df, on='player')
passing_df
Out[1016]:
player player_id position team_name player_game_count accuracy_percent aimed_passes attempts avg_depth_of_target avg_time_to_throw ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Patrick Mahomes 11765 QB KC 20 78.0 685 747 7.5 2.85 ... 1 24 10 450000000 45000000 10000000 63081905 141481905 2020 2031
1 Patrick Mahomes 11765 QB KC 20 79.2 708 780 7.4 2.87 ... 1 24 10 450000000 45000000 10000000 63081905 141481905 2020 2031
2 Patrick Mahomes 11765 QB KC 18 77.1 656 705 8.5 2.91 ... 1 24 10 450000000 45000000 10000000 63081905 141481905 2020 2031
3 Patrick Mahomes 11765 QB KC 17 77.2 549 596 8.5 2.84 ... 1 24 10 450000000 45000000 10000000 63081905 141481905 2020 2031
4 Patrick Mahomes 11765 QB KC 18 77.8 595 652 9.6 2.80 ... 1 24 10 450000000 45000000 10000000 63081905 141481905 2020 2031
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
181 Case Keenum 7323 QB MIN 17 75.1 527 569 7.9 2.72 ... 634 35 2 6250000 3125000 750000 4000000 4000000 2023 2024
182 David Blough 46508 QB DET 5 68.6 159 174 8.9 2.89 ... 1895 27 1 1232500 1232500 76250 76250 76250 2023 2023
183 Blaine Gabbert 6162 QB TEN 6 66.3 98 101 7.7 2.29 ... 1832 33 1 1317500 1317500 152500 1092500 1092500 2023 2023
184 Blaine Gabbert 6162 QB ARZ 5 65.8 161 171 9.8 2.58 ... 1832 33 1 1317500 1317500 152500 1092500 1092500 2023 2023
185 Brian Hoyer 5277 QB NE 8 68.8 202 211 8.4 2.39 ... 836 37 2 4500000 2250000 875000 4205000 4205000 2023 2024

186 rows × 53 columns

Here are the rushing stats, we will limit these to just running backs.

In [1017]:
# rushing players and stats

rushing_dfs = []

# rushing

# 2022
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary.csv'))
rushing_dfs[0]['year'] = 2022

# 2021
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (1).csv'))
rushing_dfs[1]['year'] = 2021

# 2020
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (2).csv'))
rushing_dfs[2]['year'] = 2020

# 2019
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (3).csv'))
rushing_dfs[3]['year'] = 2019

# 2018
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (4).csv'))
rushing_dfs[4]['year'] = 2018


# 2017
rushing_dfs.append(pd.read_csv('./rushing/rushing_summary (5).csv'))
rushing_dfs[5]['year'] = 2017

# create dataframe
rushing_df = pd.concat(rushing_dfs)

# remove outliers
rushing_df = rushing_df[rushing_df['position'] == 'HB']
rushing_df = rushing_df[rushing_df['attempts'] > 50]

# merge players with salary
rushing_df = pd.merge(rushing_df, salary_df, on='player')
rushing_df
Out[1017]:
player player_id position team_name player_game_count attempts avoided_tackles breakaway_attempts breakaway_percent breakaway_yards ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Derrick Henry 10679 HB TEN 16 349 69 15 27.6 424 ... 142 26 4 50000000 12500000 12000000 25500000 25500000 2020 2023
1 Derrick Henry 10679 HB TEN 9 239 36 9 26.4 264 ... 142 26 4 50000000 12500000 12000000 25500000 25500000 2020 2023
2 Derrick Henry 10679 HB TEN 17 396 77 21 31.2 645 ... 142 26 4 50000000 12500000 12000000 25500000 25500000 2020 2023
3 Derrick Henry 10679 HB TEN 18 386 71 23 33.5 665 ... 142 26 4 50000000 12500000 12000000 25500000 25500000 2020 2023
4 Derrick Henry 10679 HB TEN 16 215 45 9 27.6 292 ... 142 26 4 50000000 12500000 12000000 25500000 25500000 2020 2023
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
208 Ameer Abdullah 9487 HB DET 14 165 20 7 31.9 176 ... 1749 29 1 1750000 1750000 200000 500000 500000 2023 2023
209 Benny Snell Jr. 45954 HB PIT 16 113 14 3 21.0 80 ... 1275 21 4 3222068 805517 702068 702068 702068 2019 2022
210 Benny Snell Jr. 45954 HB PIT 11 108 20 5 23.0 98 ... 1275 21 4 3222068 805517 702068 702068 702068 2019 2022
211 Justice Hill 45865 HB BLT 17 58 8 1 8.0 18 ... 835 25 2 4510000 2255000 1000000 1250000 1250000 2023 2024
212 Ty Montgomery 9527 HB GB 8 71 4 1 13.6 37 ... 1210 29 2 3600000 1800000 300000 300000 300000 2022 2023

213 rows × 58 columns

Here are receiving stats. These are a little more complicated since these will contain primary statistics for tight ends and wide receivers. We will use the data from these files for two seperate dataframes one for the wide receivers and one for the tight ends.

In [1018]:
# receiving players and stats

receiving_dfs = []

# receiving

# 2022
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary.csv'))
receiving_dfs[0]['year'] = 2022

# 2021
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (1).csv'))
receiving_dfs[1]['year'] = 2021

# 2020
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (2).csv'))
receiving_dfs[2]['year'] = 2020

# 2019
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (3).csv'))
receiving_dfs[3]['year'] = 2019

# 2018
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (4).csv'))
receiving_dfs[4]['year'] = 2018


# 2017
receiving_dfs.append(pd.read_csv('./receiving/receiving_summary (5).csv'))
receiving_dfs[5]['year'] = 2017

# create dataframe
receiving_df = pd.concat(receiving_dfs)

# remove outliers
receiving_df = receiving_df[receiving_df['targets'] > 50]

# merge players with salary
receiving_df = pd.merge(receiving_df, salary_df, on='player')
receiving_df
Out[1018]:
player player_id position team_name player_game_count avg_depth_of_target avoided_tackles caught_percent contested_catch_rate contested_receptions ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Justin Jefferson 61398 WR MIN 18 10.7 11 73.0 55.0 22 ... 422 21 4 13122805 3280701 7103856 13122805 13122805 2020 2024
1 Justin Jefferson 61398 WR MIN 17 13.3 9 66.3 48.1 13 ... 422 21 4 13122805 3280701 7103856 13122805 13122805 2020 2024
2 Justin Jefferson 61398 WR MIN 16 12.0 13 72.7 54.5 12 ... 422 21 4 13122805 3280701 7103856 13122805 13122805 2020 2024
3 Travis Kelce 7844 TE KC 20 7.5 25 75.3 45.5 10 ... 115 30 4 57250000 14312500 0 20750000 22750000 2020 2025
4 Travis Kelce 7844 TE KC 19 7.6 23 72.8 60.0 9 ... 115 30 4 57250000 14312500 0 20750000 22750000 2020 2025
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
492 Jordan Akins 47050 TE HST 16 7.4 8 67.9 25.0 2 ... 1074 30 2 3900000 1950000 630000 1730000 1730000 2023 2024
493 O.J. Howard 11774 TE TB 14 10.1 2 65.4 41.7 5 ... 1895 28 1 1232500 1232500 76250 451250 451250 2023 2023
494 Antonio Callaway 48101 WR CLV 16 14.5 3 55.1 38.5 5 ... 1683 25 2 2135000 1067500 0 0 0 2023 2024
495 Marquise Goodwin 7859 WR SF 16 16.0 6 56.6 50.0 11 ... 1756 32 1 1700000 1700000 250000 400000 400000 2023 2023
496 Trent Taylor 11932 WR SF 15 6.9 4 72.9 66.7 8 ... 1895 28 1 1232500 1232500 52500 52500 52500 2023 2023

497 rows × 56 columns

In [1019]:
# get wide receiver data
wr_df = receiving_df[receiving_df['position'] == 'WR']
wr_df
Out[1019]:
player player_id position team_name player_game_count avg_depth_of_target avoided_tackles caught_percent contested_catch_rate contested_receptions ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Justin Jefferson 61398 WR MIN 18 10.7 11 73.0 55.0 22 ... 422 21 4 13122805 3280701 7103856 13122805 13122805 2020 2024
1 Justin Jefferson 61398 WR MIN 17 13.3 9 66.3 48.1 13 ... 422 21 4 13122805 3280701 7103856 13122805 13122805 2020 2024
2 Justin Jefferson 61398 WR MIN 16 12.0 13 72.7 54.5 12 ... 422 21 4 13122805 3280701 7103856 13122805 13122805 2020 2024
9 Tyreek Hill 10799 WR MIA 18 12.6 12 69.6 50.0 13 ... 19 28 4 120000000 30000000 25500000 52535000 72200000 2022 2026
10 Tyreek Hill 10799 WR KC 20 10.6 15 73.6 28.6 6 ... 19 28 4 120000000 30000000 25500000 52535000 72200000 2022 2026
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
486 Randall Cobb 6216 WR GB 15 6.1 5 77.6 38.5 5 ... 1279 32 1 3000000 3000000 250000 250000 250000 2023 2023
488 Phillip Dorsett 9462 WR NE 15 15.5 0 50.0 36.4 4 ... 1954 30 1 1217500 1217500 0 0 0 2023 2023
494 Antonio Callaway 48101 WR CLV 16 14.5 3 55.1 38.5 5 ... 1683 25 2 2135000 1067500 0 0 0 2023 2024
495 Marquise Goodwin 7859 WR SF 16 16.0 6 56.6 50.0 11 ... 1756 32 1 1700000 1700000 250000 400000 400000 2023 2023
496 Trent Taylor 11932 WR SF 15 6.9 4 72.9 66.7 8 ... 1895 28 1 1232500 1232500 52500 52500 52500 2023 2023

331 rows × 56 columns

In [1020]:
# tight end data
te_df = receiving_df[receiving_df['position'] == 'TE']
te_df
Out[1020]:
player player_id position team_name player_game_count avg_depth_of_target avoided_tackles caught_percent contested_catch_rate contested_receptions ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
3 Travis Kelce 7844 TE KC 20 7.5 25 75.3 45.5 10 ... 115 30 4 57250000 14312500 0 20750000 22750000 2020 2025
4 Travis Kelce 7844 TE KC 19 7.6 23 72.8 60.0 9 ... 115 30 4 57250000 14312500 0 20750000 22750000 2020 2025
5 Travis Kelce 7844 TE KC 18 8.9 16 76.4 30.4 7 ... 115 30 4 57250000 14312500 0 20750000 22750000 2020 2025
6 Travis Kelce 7844 TE KC 19 9.1 18 76.3 53.8 14 ... 115 30 4 57250000 14312500 0 20750000 22750000 2020 2025
7 Travis Kelce 7844 TE KC 18 9.3 11 73.4 64.0 16 ... 115 30 4 57250000 14312500 0 20750000 22750000 2020 2025
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
489 Kyle Rudolph 6195 TE MIN 17 7.3 5 81.8 80.0 4 ... 1695 32 1 2000000 2000000 500000 2000000 2000000 2022 2022
490 Kyle Rudolph 6195 TE MIN 16 7.3 4 84.2 53.8 7 ... 1695 32 1 2000000 2000000 500000 2000000 2000000 2022 2022
491 Kyle Rudolph 6195 TE MIN 18 8.2 3 72.4 38.9 7 ... 1695 32 1 2000000 2000000 500000 2000000 2000000 2022 2022
492 Jordan Akins 47050 TE HST 16 7.4 8 67.9 25.0 2 ... 1074 30 2 3900000 1950000 630000 1730000 1730000 2023 2024
493 O.J. Howard 11774 TE TB 14 10.1 2 65.4 41.7 5 ... 1895 28 1 1232500 1232500 76250 451250 451250 2023 2023

100 rows × 56 columns

Here are the blocking stats. These contain primary stats for offensive lineman which are typically broken down into 3 positions. Center, guard and tackle. We will create three seperate dataframes for this reason.

In [1021]:
# offense_blocking players and stats

offense_blocking_dfs = []

# offense_blocking

# 2022
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking.csv'))
offense_blocking_dfs[0]['year'] = 2022

# 2021
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (1).csv'))
offense_blocking_dfs[1]['year'] = 2021

# 2020
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (2).csv'))
offense_blocking_dfs[2]['year'] = 2020

# 2019
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (3).csv'))
offense_blocking_dfs[3]['year'] = 2019

# 2018
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (4).csv'))
offense_blocking_dfs[4]['year'] = 2018


# 2017
offense_blocking_dfs.append(pd.read_csv('./offense_blocking/offense_blocking (5).csv'))
offense_blocking_dfs[5]['year'] = 2017

# create dataframe
offense_blocking_df = pd.concat(offense_blocking_dfs)

# remove outliers
offense_blocking_df = offense_blocking_df[offense_blocking_df['snap_counts_block'] > 100]

# merge players with salary
offense_blocking_df = pd.merge(offense_blocking_df, salary_df, on='player')
offense_blocking_df
Out[1021]:
player player_id position team_name player_game_count block_percent declined_penalties franchise_id grades_offense grades_pass_block ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Jason Kelce 6343 C PHI 20 100.0 1 24 89.4 83.3 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
1 Jason Kelce 6343 C PHI 18 100.0 0 24 84.5 69.2 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
2 Jason Kelce 6343 C PHI 16 100.0 0 24 69.6 62.0 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
3 Jason Kelce 6343 C PHI 17 100.0 1 24 81.1 67.6 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
4 Jason Kelce 6343 C PHI 18 100.0 0 24 87.1 88.0 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1750 Brian Hoyer 5277 QB NE 12 34.5 0 19 63.6 NaN ... 836 37 2 4500000 2250000 875000 4205000 4205000 2023 2024
1751 Cordarrelle Patterson 7810 WR OAK 16 29.6 0 23 65.5 62.4 ... 486 31 2 10500000 5250000 5000000 5000000 5000000 2022 2023
1752 Blaine Gabbert 6162 QB ARZ 5 33.3 0 1 45.4 NaN ... 1832 33 1 1317500 1317500 152500 1092500 1092500 2023 2023
1753 Trent Taylor 11932 WR SF 15 22.5 0 28 68.4 69.6 ... 1895 28 1 1232500 1232500 52500 52500 52500 2023 2023
1754 Derrick Henry 10679 HB TEN 18 20.5 0 31 73.0 76.5 ... 142 26 4 50000000 12500000 12000000 25500000 25500000 2020 2023

1755 rows × 42 columns

In [1022]:
# center data
c_df = offense_blocking_df[offense_blocking_df['position'] == 'C']
c_df
Out[1022]:
player player_id position team_name player_game_count block_percent declined_penalties franchise_id grades_offense grades_pass_block ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Jason Kelce 6343 C PHI 20 100.0 1 24 89.4 83.3 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
1 Jason Kelce 6343 C PHI 18 100.0 0 24 84.5 69.2 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
2 Jason Kelce 6343 C PHI 16 100.0 0 24 69.6 62.0 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
3 Jason Kelce 6343 C PHI 17 100.0 1 24 81.1 67.6 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
4 Jason Kelce 6343 C PHI 18 100.0 0 24 87.1 88.0 ... 389 35 1 14250000 14250000 10335000 14250000 14250000 2023 2023
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1564 Michael Deiter 38554 C MIA 8 100.0 1 17 60.6 57.2 ... 1895 26 1 1232500 1232500 125000 125000 125000 2023 2023
1570 Kyle Fuller 11998 C SEA 10 100.0 0 29 46.5 31.3 ... 1644 30 1 2500000 2500000 1380000 2500000 2500000 2022 2022
1618 Trey Hill 81653 C CIN 7 100.0 1 7 53.3 65.3 ... 1180 21 4 3672352 918088 192352 192352 192352 2021 2024
1656 Trystan Colon-Castillo 41117 C BLT 12 98.0 0 3 62.6 74.5 ... 1786 24 1 1660000 1660000 250000 250000 250000 2023 2023
1657 Trystan Colon-Castillo 41117 C BLT 2 100.0 0 3 67.6 70.0 ... 1786 24 1 1660000 1660000 250000 250000 250000 2023 2023

127 rows × 42 columns

In [1023]:
# guard data
g_df = offense_blocking_df[offense_blocking_df['position'] == 'G']
g_df
Out[1023]:
player player_id position team_name player_game_count block_percent declined_penalties franchise_id grades_offense grades_pass_block ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
6 Isaac Seumalo 10713 G PHI 20 100.0 3 24 72.7 78.2 ... 257 29 3 24000000 8000000 6950000 6950000 6950000 2023 2025
7 Isaac Seumalo 10713 G PHI 3 100.0 0 24 74.8 82.9 ... 257 29 3 24000000 8000000 6950000 6950000 6950000 2023 2025
8 Isaac Seumalo 10713 G PHI 9 100.0 0 24 62.4 77.4 ... 257 29 3 24000000 8000000 6950000 6950000 6950000 2023 2025
9 Isaac Seumalo 10713 G PHI 17 100.0 0 24 70.6 70.1 ... 257 29 3 24000000 8000000 6950000 6950000 6950000 2023 2025
10 Isaac Seumalo 10713 G PHI 15 99.8 1 24 67.3 64.4 ... 257 29 3 24000000 8000000 6950000 6950000 6950000 2023 2025
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1686 Shane Lemieux 28149 G NYG 11 100.0 2 21 32.2 16.9 ... 1194 23 4 3642140 910535 347140 347140 347140 2020 2023
1689 Colton McKivitz 29314 G SF 9 100.0 0 28 53.1 49.9 ... 822 26 2 4610000 2305000 730000 1870000 1870000 2023 2024
1690 Chris Hubbard 8051 G CLV 8 100.0 0 8 71.9 78.3 ... 1346 30 1 2622500 2622500 152500 2622500 2622500 2022 2022
1699 Aaron Stinnie 28653 G TB 6 100.0 0 30 57.6 41.9 ... 1821 28 1 1400000 1400000 0 150000 150000 2023 2023
1721 Nick Harris 46752 G CLV 2 100.0 0 8 51.3 34.0 ... 1207 21 4 3609648 902412 314648 314648 314648 2020 2023

336 rows × 42 columns

In [1024]:
# tackle data
t_df = offense_blocking_df[offense_blocking_df['position'] == 'T']
t_df
Out[1024]:
player player_id position team_name player_game_count block_percent declined_penalties franchise_id grades_offense grades_pass_block ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
14 Orlando Brown Jr. 46227 T KC 20 100.0 1 16 75.4 76.8 ... 93 26 4 64092000 16023000 31100000 31100000 31100000 2023 2026
15 Orlando Brown Jr. 46227 T KC 19 100.0 3 16 75.4 74.4 ... 93 26 4 64092000 16023000 31100000 31100000 31100000 2023 2026
16 Orlando Brown Jr. 46227 T BLT 18 100.0 0 3 77.8 77.8 ... 93 26 4 64092000 16023000 31100000 31100000 31100000 2023 2026
17 Orlando Brown Jr. 46227 T BLT 17 100.0 2 3 73.6 83.3 ... 93 26 4 64092000 16023000 31100000 31100000 31100000 2023 2026
18 Orlando Brown Jr. 46227 T BLT 16 100.0 0 3 68.6 75.1 ... 93 26 4 64092000 16023000 31100000 31100000 31100000 2023 2026
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1705 Kendall Lamm 10017 T CLV 10 98.4 1 8 66.7 69.5 ... 1832 30 1 1317500 1317500 100000 200000 200000 2023 2023
1706 Kendall Lamm 10017 T HST 15 100.0 0 13 65.2 76.0 ... 1832 30 1 1317500 1317500 100000 200000 200000 2023 2023
1707 Kendall Lamm 10017 T HST 6 100.0 0 13 58.4 50.2 ... 1832 30 1 1317500 1317500 100000 200000 200000 2023 2023
1742 Ja'Wuan James 8654 T MIA 15 99.9 4 17 72.4 67.4 ... 831 29 2 4575000 2287500 500000 500000 500000 2021 2022
1743 Ja'Wuan James 8654 T MIA 8 100.0 0 17 69.0 77.8 ... 831 29 2 4575000 2287500 500000 500000 500000 2021 2022

373 rows × 42 columns

On the defensive side we have to do different things with our dataframes. Instead of one category containing information for multiple positions in this case our positions contain information we want from multiple categories. For the interior defensive line and edge rushers, we want stats from our run defense and pass rushing dataframes. For linebackers, we want coverage and run defense statistics. To do this we will merge on positions and player names so that we can split our data up accordingly.

In [1025]:
# run_defense players and stats

run_defense_dfs = []

# run_defense

# 2022
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary.csv'))
run_defense_dfs[0]['year'] = 2022

# 2021
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (1).csv'))
run_defense_dfs[1]['year'] = 2021

# 2020
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (2).csv'))
run_defense_dfs[2]['year'] = 2020

# 2019
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (3).csv'))
run_defense_dfs[3]['year'] = 2019

# 2018
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (4).csv'))
run_defense_dfs[4]['year'] = 2018


# 2017
run_defense_dfs.append(pd.read_csv('./run_defense/run_defense_summary (5).csv'))
run_defense_dfs[5]['year'] = 2017

# create dataframe
run_defense_df = pd.concat(run_defense_dfs)

# remove outliers
run_defense_df = run_defense_df[run_defense_df['snap_counts_run'] > 100]

# merge players with salary
run_defense_df = pd.merge(run_defense_df, salary_df, on='player')
run_defense_df
Out[1025]:
player player_id position team_name player_game_count assists avg_depth_of_tackle declined_penalties forced_fumbles franchise_id ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Calais Campbell 4364 DI BLT 15 11 3.5 1 0 3 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
1 Calais Campbell 4364 DI BLT 15 12 2.1 1 0 3 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
2 Calais Campbell 4364 DI BLT 14 5 1.5 0 0 3 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
3 Calais Campbell 4364 DI JAX 16 10 1.5 0 2 15 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
4 Calais Campbell 4364 ED JAX 16 10 0.8 0 1 15 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1848 Jason Cabinda 50714 LB OAK 10 3 5.6 0 0 23 ... 953 25 2 4100000 2050000 1000000 2035000 2035000 2022 2023
1849 Leonard Johnson 7385 CB BUF 16 1 3.8 1 0 4 ... 1335 24 3 2695000 898333 0 0 0 2023 2025
1850 Justin Bethel 7654 CB ARZ 9 4 5.2 0 0 1 ... 1832 32 1 1317500 1317500 75000 75000 75000 2023 2023
1851 Cody Davis 8458 S LA 7 4 8.6 0 0 26 ... 1682 33 1 2200000 2200000 0 500000 500000 2023 2023
1852 Miles Killebrew 10745 S DET 14 7 5.8 0 0 11 ... 986 28 2 4000000 2000000 965000 965000 965000 2022 2023

1853 rows × 35 columns

In [1026]:
# pass_rush players and stats

pass_rush_dfs = []

# pass_rush

# 2022
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary.csv'))
pass_rush_dfs[0]['year'] = 2022

# 2021
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (1).csv'))
pass_rush_dfs[1]['year'] = 2021

# 2020
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (2).csv'))
pass_rush_dfs[2]['year'] = 2020

# 2019
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (3).csv'))
pass_rush_dfs[3]['year'] = 2019

# 2018
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (4).csv'))
pass_rush_dfs[4]['year'] = 2018


# 2017
pass_rush_dfs.append(pd.read_csv('./pass_rush/pass_rush_summary (5).csv'))
pass_rush_dfs[5]['year'] = 2017

# create dataframe
pass_rush_df = pd.concat(pass_rush_dfs)

# remove outliers
pass_rush_df = pass_rush_df[pass_rush_df['snap_counts_pass_rush'] > 100]

# merge players with salary
pass_rush_df = pd.merge(pass_rush_df, salary_df, on='player')
pass_rush_df
Out[1026]:
player player_id position team_name player_game_count batted_passes declined_penalties franchise_id grades_pass_rush_defense hits ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Calais Campbell 4364 DI BLT 15 2 1 3 71.7 12 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
1 Calais Campbell 4364 DI BLT 15 1 1 3 69.5 11 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
2 Calais Campbell 4364 DI BLT 14 6 0 3 73.0 4 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
3 Calais Campbell 4364 DI JAX 16 1 0 15 81.4 19 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
4 Calais Campbell 4364 ED JAX 16 2 0 15 70.4 11 ... 601 36 1 7000000 7000000 3000000 7000000 7000000 2023 2023
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
956 C.J. Mosley 8652 LB BLT 16 1 0 3 57.3 1 ... 47 26 5 85000000 17000000 7500000 43000000 51000000 2019 2024
957 Jarrad Davis 11776 LB DET 16 0 0 11 74.5 3 ... 1980 28 1 1180000 1180000 0 0 0 2023 2023
958 Jayon Brown 11910 LB TEN 16 0 0 31 86.6 4 ... 1890 27 1 1250000 1250000 125000 625000 625000 2022 2022
959 Deadrin Senat 48815 DI ATL 15 0 0 2 68.7 2 ... 1895 28 1 1232500 1232500 152500 652500 652500 2023 2023
960 Christian Kirksey 8706 LB CLV 16 0 0 8 56.7 2 ... 492 29 2 10000000 5000000 2000000 4000000 4000000 2022 2023

961 rows × 45 columns

In [1027]:
# defense_coverage players and stats

defense_coverage_dfs = []

# defense_coverage

# 2022
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary.csv'))
defense_coverage_dfs[0]['year'] = 2022

# 2021
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (1).csv'))
defense_coverage_dfs[1]['year'] = 2021

# 2020
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (2).csv'))
defense_coverage_dfs[2]['year'] = 2020

# 2019
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (3).csv'))
defense_coverage_dfs[3]['year'] = 2019

# 2018
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (4).csv'))
defense_coverage_dfs[4]['year'] = 2018


# 2017
defense_coverage_dfs.append(pd.read_csv('./defense_coverage/defense_coverage_summary (5).csv'))
defense_coverage_dfs[5]['year'] = 2017

# create dataframe
defense_coverage_df = pd.concat(defense_coverage_dfs)

# remove outliers
defense_coverage_df = defense_coverage_df[defense_coverage_df['snap_counts_coverage'] > 100]

# merge players with salary
defense_coverage_df = pd.merge(defense_coverage_df, salary_df, on='player')
defense_coverage_df
Out[1027]:
player player_id position team_name player_game_count assists avg_depth_of_target catch_rate coverage_percent coverage_snaps_per_reception ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Patrick Peterson 6157 CB MIN 18 2 12.1 60.0 99.3 14.6 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
1 Patrick Peterson 6157 CB MIN 13 3 10.0 57.8 100.0 13.9 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
2 Patrick Peterson 6157 CB ARZ 16 2 12.1 66.7 99.4 12.8 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
3 Patrick Peterson 6157 CB ARZ 10 3 10.4 67.2 99.3 10.8 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
4 Patrick Peterson 6157 CB ARZ 16 2 11.7 57.7 99.5 19.2 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1124 Tony Brown 49255 CB GB 10 1 10.9 63.2 100.0 7.0 ... 1895 26 1 1232500 1232500 152500 500000 500000 2023 2023
1125 Leonard Johnson 7385 CB BUF 16 2 6.3 71.1 96.6 7.2 ... 1335 24 3 2695000 898333 0 0 0 2023 2025
1126 Cody Davis 8458 S LA 7 0 12.4 22.2 98.3 82.0 ... 1682 33 1 2200000 2200000 0 500000 500000 2023 2023
1127 Bud Dupree 9455 ED PIT 16 0 6.9 69.2 27.1 14.4 ... 1279 30 1 3000000 3000000 1250000 2490000 2490000 2023 2023
1128 Miles Killebrew 10745 S DET 15 2 11.8 64.0 94.0 13.3 ... 986 28 2 4000000 2000000 965000 965000 965000 2022 2023

1129 rows × 51 columns

In [1028]:
idl_df = pd.merge(run_defense_df[run_defense_df['position'] == 'DI'], pass_rush_df[pass_rush_df['position'] == 'DI'])
idl_df
Out[1028]:
player player_id position team_name player_game_count assists avg_depth_of_tackle declined_penalties forced_fumbles franchise_id ... true_pass_set_hurries true_pass_set_pass_rush_opp true_pass_set_pass_rush_percent true_pass_set_pass_rush_win_rate true_pass_set_pass_rush_wins true_pass_set_prp true_pass_set_sacks true_pass_set_snap_counts_pass_play true_pass_set_snap_counts_pass_rush true_pass_set_total_pressures
0 Calais Campbell 4364 DI BLT 15 11 3.5 1 0 3 ... 13 183 98.5 21.3 39 9.0 6 196 193 27
1 Calais Campbell 4364 DI BLT 15 12 2.1 1 0 3 ... 15 183 100.0 15.3 28 5.7 0 194 194 21
2 Calais Campbell 4364 DI BLT 14 5 1.5 0 0 3 ... 12 131 98.0 13.7 18 6.5 2 151 148 15
3 Calais Campbell 4364 DI JAX 16 10 1.5 0 2 15 ... 28 218 100.0 22.9 50 11.2 5 238 238 44
4 Tyson Alualu 5535 DI PIT 17 2 2.6 0 0 25 ... 6 72 100.0 9.7 7 5.6 0 75 75 8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
429 L.J. Collier 48822 DI SEA 17 5 1.4 0 0 29 ... 6 97 99.0 14.4 14 6.7 3 101 100 10
430 Daniel Ekuale 48874 DI JAX 9 7 3.0 0 0 15 ... 0 29 100.0 3.4 1 1.7 0 32 32 1
431 Sheldon Day 10737 DI SF 19 3 1.9 0 0 28 ... 2 48 100.0 6.3 3 4.2 1 50 50 3
432 Sheldon Day 10737 DI SF 12 2 3.5 0 0 28 ... 8 60 100.0 18.3 11 10.0 2 61 61 10
433 Sheldon Day 10737 DI SF 12 4 1.9 0 0 28 ... 1 59 100.0 8.5 5 4.2 2 68 68 3

434 rows × 60 columns

In [1029]:
edge_df = pd.merge(run_defense_df[run_defense_df['position'] == 'ED'], pass_rush_df[pass_rush_df['position'] == 'ED'])
edge_df
Out[1029]:
player player_id position team_name player_game_count assists avg_depth_of_tackle declined_penalties forced_fumbles franchise_id ... true_pass_set_hurries true_pass_set_pass_rush_opp true_pass_set_pass_rush_percent true_pass_set_pass_rush_win_rate true_pass_set_pass_rush_wins true_pass_set_prp true_pass_set_sacks true_pass_set_snap_counts_pass_play true_pass_set_snap_counts_pass_rush true_pass_set_total_pressures
0 Calais Campbell 4364 ED JAX 16 10 0.8 0 1 15 ... 18 204 100.0 17.6 36 10.8 11 223 223 33
1 Calais Campbell 4364 ED JAX 19 13 1.9 0 1 15 ... 35 321 100.0 24.9 80 12.3 15 344 344 64
2 Brandon Graham 5538 ED PHI 20 2 1.1 0 0 24 ... 21 144 96.8 31.3 45 16.0 12 158 153 34
3 Brandon Graham 5538 ED PHI 16 6 1.7 0 0 24 ... 20 140 96.8 27.1 38 12.5 5 156 151 30
4 Brandon Graham 5538 ED PHI 17 6 0.5 0 1 24 ... 26 235 97.3 20.4 48 10.4 8 258 251 41
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
350 Isaac Rochell 11980 ED LAC 18 2 2.4 0 0 27 ... 16 193 100.0 14.0 27 7.5 6 210 210 23
351 Takkarist McKinley 11781 ED ATL 14 6 2.9 0 0 2 ... 12 99 91.7 20.2 20 10.6 2 120 110 19
352 Takkarist McKinley 11781 ED ATL 15 3 3.2 0 0 2 ... 24 193 99.1 20.7 40 11.1 6 217 215 37
353 Takkarist McKinley 11781 ED ATL 18 4 0.4 1 0 2 ... 23 148 99.4 23.0 34 13.5 7 159 158 33
354 Jordan Willis 11828 ED CIN 16 3 1.8 0 0 7 ... 10 138 100.0 13.8 19 5.8 2 148 148 14

355 rows × 60 columns

In [1030]:
lb_df = pd.merge(run_defense_df[run_defense_df['position'] == 'LB'], defense_coverage_df[defense_coverage_df['position'] == 'LB'])
lb_df
Out[1030]:
player player_id position team_name player_game_count assists avg_depth_of_tackle declined_penalties forced_fumbles franchise_id ... qb_rating_against receptions snap_counts_coverage snap_counts_pass_play targets touchdowns yards yards_after_catch yards_per_coverage_snap yards_per_reception

0 rows × 56 columns

For cornerbacks and safeties, we only need to worry about coverage, so we will split it the same way we did for offensive players.

In [1031]:
# cornerbacks
cb_df = defense_coverage_df[defense_coverage_df['position'] == 'CB']
cb_df
Out[1031]:
player player_id position team_name player_game_count assists avg_depth_of_target catch_rate coverage_percent coverage_snaps_per_reception ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
0 Patrick Peterson 6157 CB MIN 18 2 12.1 60.0 99.3 14.6 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
1 Patrick Peterson 6157 CB MIN 13 3 10.0 57.8 100.0 13.9 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
2 Patrick Peterson 6157 CB ARZ 16 2 12.1 66.7 99.4 12.8 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
3 Patrick Peterson 6157 CB ARZ 10 3 10.4 67.2 99.3 10.8 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
4 Patrick Peterson 6157 CB ARZ 16 2 11.7 57.7 99.5 19.2 ... 398 32 2 14000000 7000000 5850000 5850000 5850000 2023 2024
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1106 Michael Ojemudia 50327 CB DEN 14 2 13.7 68.5 99.5 10.5 ... 812 22 4 4751392 1187848 1015560 1015560 1015560 2020 2023
1109 Lamar Jackson 50532 CB NYJ 8 2 12.1 66.7 100.0 9.8 ... 2 26 5 260000000 52000000 72500000 135000000 185000000 2023 2027
1116 Siran Neal 27130 CB BUF 7 1 5.6 72.2 91.8 9.0 ... 519 27 3 9000000 3000000 1600000 2900000 3300000 2022 2024
1124 Tony Brown 49255 CB GB 10 1 10.9 63.2 100.0 7.0 ... 1895 26 1 1232500 1232500 152500 500000 500000 2023 2023
1125 Leonard Johnson 7385 CB BUF 16 2 6.3 71.1 96.6 7.2 ... 1335 24 3 2695000 898333 0 0 0 2023 2025

441 rows × 51 columns

In [1032]:
# Safeties
s_df = defense_coverage_df[defense_coverage_df['position'] == 'S']
s_df
Out[1032]:
player player_id position team_name player_game_count assists avg_depth_of_target catch_rate coverage_percent coverage_snaps_per_reception ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
30 Harrison Smith 7641 S MIN 15 5 10.0 80.0 97.8 16.4 ... 96 32 4 64000000 16000000 9579410 14179410 26379410 2021 2025
31 Harrison Smith 7641 S MIN 15 9 10.1 74.1 93.5 28.9 ... 96 32 4 64000000 16000000 9579410 14179410 26379410 2021 2025
32 Harrison Smith 7641 S MIN 16 10 15.6 65.4 93.9 31.3 ... 96 32 4 64000000 16000000 9579410 14179410 26379410 2021 2025
33 Harrison Smith 7641 S MIN 17 10 9.3 66.7 94.4 25.3 ... 96 32 4 64000000 16000000 9579410 14179410 26379410 2021 2025
34 Harrison Smith 7641 S MIN 16 1 12.4 66.7 92.7 22.2 ... 96 32 4 64000000 16000000 9579410 14179410 26379410 2021 2025
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1111 Michael Thomas 7279 S NYG 16 4 6.8 82.6 90.6 9.7 ... 1983 32 1 1165000 1165000 0 0 0 2023 2023
1112 Michael Thomas 7279 S NYG 16 3 6.8 63.6 92.6 14.1 ... 492 30 1 10000000 10000000 5000000 10000000 10000000 2023 2024
1113 Michael Thomas 7279 S NYG 16 3 6.8 63.6 92.6 14.1 ... 1983 32 1 1165000 1165000 0 0 0 2023 2023
1126 Cody Davis 8458 S LA 7 0 12.4 22.2 98.3 82.0 ... 1682 33 1 2200000 2200000 0 500000 500000 2023 2023
1128 Miles Killebrew 10745 S DET 15 2 11.8 64.0 94.0 13.3 ... 986 28 2 4000000 2000000 965000 965000 965000 2022 2023

344 rows × 51 columns

Now we need to explore our data and understand it. This is where we analyze the features of our dataset and start to determine which ones are more valueable for our hypothesis. To do this we will use single vector decomposition. We will import svds from the scipy linear algebra library and apply it to our data. We need to first remove any irrelevant data from ourdatasets such as player_id, franchise_id and team and handle any missing data. For missing data we replaced it with the mean from that column. Since pretty much all missing data has already been dropped this will not affect the analysis here much.

In [1033]:
import numpy as np
from scipy.sparse.linalg import svds

# passing_df = passing_df.fillna(passing_df.mean())
numeric_cols = passing_df.select_dtypes(include=[np.number]).columns.tolist()

passing_df[numeric_cols] = passing_df[numeric_cols].replace(-1, passing_df[numeric_cols].mean()).fillna(passing_df[numeric_cols].mean())

# Extract the matrix of numerical features you want to perform SVD on
features = passing_df.select_dtypes(include=[np.number])

features = features.drop([ 'year',
       'Signed Age', 'Yrs', 'Value', 'AAV', 'Sign Bonus', "G'teed @ Sign",
       "Practical G'teed", 'player_id', 'franchise_id', 'Rank', 'start_year', 'end_year', 'attempts', 'passing_snaps'], axis=1)

cols = len(features.columns) -1
features
# Perform SVD on the matrix
U, S, Vt = svds(features.to_numpy(), k=cols)

# Check the shape of the SVD factors
Vt[cols-1]
features.shape
covariance = np.dot(features.to_numpy().T, features.to_numpy())
covariance.shape
U, S, Vt = svds(covariance, k=cols)
print('Shape of U:', U.shape)
print('Shape of S:', S.shape)
print('Shape of Vt:', Vt.shape)
Shape of U: (35, 34)
Shape of S: (34,)
Shape of Vt: (34, 35)

These values indicate how much correlation there are between all the data. Each element in the vector represents its respective column. If the value is closer to 0 that means there is high correlation between that feature and the dataset. Now we will find and list out the features themselves in order from most correlated to least to give us a good picture of the value for our features.

In [1034]:
print(S)
print(Vt[cols-2])
print(features.columns)
[5.32132048e+00 1.95195817e+01 2.21593767e+01 2.69058572e+01
 3.08601982e+01 3.25226529e+01 4.34286074e+01 1.25780933e+02
 1.65291827e+02 2.07045581e+02 3.03147850e+02 3.95140046e+02
 4.27939212e+02 5.63524917e+02 7.81887424e+02 1.16018319e+03
 1.28718542e+03 1.84625105e+03 2.97174001e+03 3.64043634e+03
 4.15132218e+03 6.67026650e+03 7.24450580e+03 9.67151064e+03
 1.20079857e+04 1.65821938e+04 2.13225843e+04 2.93136814e+04
 4.74373874e+04 5.25951189e+04 1.52070323e+05 5.24450674e+05
 1.16167356e+06 2.21986736e+09]
[-2.34379971e-02 -3.21159966e-01 -3.07499708e-01 -3.80265072e-02
 -1.31386273e-02 -1.01691087e-02  2.17564836e-02 -1.07714363e-02
 -2.71265983e-01 -1.30782715e-01 -1.84448929e-04 -2.35771655e-01
 -3.68752391e-02 -4.75904584e-01 -2.26877224e-02 -1.49394939e-02
 -2.54390389e-01 -2.41356439e-01 -2.30091387e-01 -2.94629687e-01
 -9.20545531e-03 -1.64021726e-02 -6.77729132e-03 -1.03144641e-01
 -3.27472015e-01 -3.78441921e-02 -7.04074693e-02 -5.86298697e-02
 -4.35628176e-03 -1.32807256e-02  2.70608404e-02 -2.86607717e-02
 -1.84800500e-02  1.75919108e-01 -2.68586227e-02]
Index(['player_game_count', 'accuracy_percent', 'aimed_passes',
       'avg_depth_of_target', 'avg_time_to_throw', 'bats', 'big_time_throws',
       'btt_rate', 'completion_percent', 'completions', 'declined_penalties',
       'def_gen_pressures', 'drop_rate', 'dropbacks', 'drops', 'first_downs',
       'grades_hands_fumble', 'grades_offense', 'grades_pass', 'grades_run',
       'hit_as_threw', 'interceptions', 'penalties', 'pressure_to_sack_rate',
       'qb_rating', 'sack_percent', 'sacks', 'scrambles', 'spikes',
       'thrown_aways', 'touchdowns', 'turnover_worthy_plays', 'twp_rate',
       'yards', 'ypa'],
      dtype='object')

We want to plot our eigenvalues to get an idea of which vectors tell us the most about our data.

In [1035]:
import matplotlib.pyplot as plt

plt.scatter([i for i in range(len(S))], np.log(S))
plt.title("Eigenvalues")
plt.xlabel("index")
plt.ylabel('log(value)')
Out[1035]:
Text(0, 0.5, 'log(value)')

Now we want to sort the last vector in the Vt matrix in order to vizualise the variability of our features. The larger values indicate stronger variability. With this we must also sort the 3 before that. As seen in the graph above the last 4 dots are growing at a much faster rate than the rest of the dots. We want to visualize those 4 vectors but we also must keep the sorting consistent so we will store all the values in vectors at the same time to not lose order of our features.

In [1036]:
sorted_feature_weights1 = sorted(Vt[cols-1], reverse=True)
sorted_feature_weights2 = []
sorted_feature_weights3 = []
sorted_feature_weights4 = []
sorted_features = []
for i, w in enumerate(sorted_feature_weights1):
    fInd = np.where(Vt[cols-1] == w)[0][0]
    print(f'{i+1}: {list(features.columns)[fInd]}, {w}')
    sorted_features.append(list(features.columns)[fInd])
    sorted_feature_weights2.append(Vt[cols-2][fInd])
    sorted_feature_weights3.append(Vt[cols-3][fInd])
    sorted_feature_weights4.append(Vt[cols-4][fInd])
1: declined_penalties, -5.1902476351690364e-05
2: spikes, -0.0005040851831544245
3: avg_time_to_throw, -0.0007336287756775956
4: twp_rate, -0.0008677507881194244
5: penalties, -0.0010294444872858493
6: btt_rate, -0.0012020187441530163
7: hit_as_threw, -0.001488683637183896
8: sack_percent, -0.0016230606162863943
9: drop_rate, -0.0017654610797555451
10: ypa, -0.0019369059513612257
11: bats, -0.0022662809650424204
12: avg_depth_of_target, -0.00227007044048473
13: interceptions, -0.002789288795920271
14: player_game_count, -0.003960862305946379
15: pressure_to_sack_rate, -0.004807589673502431
16: turnover_worthy_plays, -0.005019104882233476
17: thrown_aways, -0.0058182865248409075
18: drops, -0.006068845224694851
19: touchdowns, -0.006455059194627049
20: big_time_throws, -0.006606178892858496
21: scrambles, -0.007147602683265029
22: sacks, -0.008843311923979863
23: grades_hands_fumble, -0.015984998460563206
24: completion_percent, -0.01703716344222998
25: grades_run, -0.017993910644480807
26: grades_pass, -0.01950372002569472
27: accuracy_percent, -0.019768600696350242
28: grades_offense, -0.020069424039326364
29: qb_rating, -0.024586800717168903
30: def_gen_pressures, -0.04979298030384959
31: first_downs, -0.050217903281132084
32: completions, -0.08555131303500313
33: aimed_passes, -0.12160033610316452
34: dropbacks, -0.14797675341032865
35: yards, -0.9736348714320443

Now we will put everything together in one graph. We will show a scatter diagram from each of the 4 vectors. We will display the values for each feature individually. This will give us a good idea of the variability of the features so that we can understand which ones will be useful when trying to predict salary and which ones will tell us things we already know and don't need to be trained on at the risk of overfitting.

In [1037]:
# scatter plot for all of the vectors
x = [i for i in range(35)]
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(sorted_feature_weights1[::-1], x, color='black')
ax.scatter(sorted_feature_weights2[::-1], x, color='red')
ax.scatter(sorted_feature_weights3[::-1], x, color='blue')
ax.scatter(sorted_feature_weights4[::-1], x, color='green')
ax.set_yticks(range(35))
ax.set_yticklabels(sorted_features[::-1])
plt.xlabel("value")
plt.show()

Now we want to explore some of these features individually. Let's start with yards since in our principal component analysis it showed to be the feature with the most variability. Now we want to see how it correlates with our targets.

In [1038]:
# yards and average salary relationship
plt.cla()
m, b = np.polyfit(passing_df['yards'], passing_df["AAV"],1)
plt.scatter( passing_df['yards'], passing_df["AAV"],)
plt.plot(passing_df['yards'], m*passing_df['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('Yards vs AAV')
Out[1038]:
Text(0.5, 1.0, 'Yards vs AAV')

It's fairly hard to tell what is going on but we do indeed see a positive correlation between the yards and the salary, with the players earning a higher salary generally throwing for more yards than those who aren't. The next notable feature we want to see is dropbacks.

In [1039]:
# AAV and dropbacks relationship
plt.cla()
plt.scatter(passing_df['dropbacks'],passing_df["AAV"])
plt.xlabel('dropbacks')
plt.ylabel('AAV')
plt.title('dropbacks vs AAV')
Out[1039]:
Text(0.5, 1.0, 'dropbacks vs AAV')

It looks surprisingly similar. Let's look at pff grade, which again is a grade made by analysts at pro football focus where they grade every play of the player.

In [1040]:
# AAV and pff grade relationship
plt.cla()
m, b = np.polyfit(passing_df['grades_offense'], passing_df["AAV"],1)
plt.scatter(passing_df['grades_offense'], passing_df["AAV"],)
plt.plot(passing_df['grades_offense'], m*passing_df['grades_offense'] + b, color='red')
plt.xlabel('PFF grade')
plt.ylabel('AAV')
plt.title('PFF grade vs AAV')
plt.show()

What is very noticeable between all of these graphs is the two clusters that appear in the plots. These clusters are separated by salary. The lower cluster represents the players on the rookie contracts while the upper cluster are players who have been paid second contracts, which are significantly higher. When a player is first drafted they are essentially assigned a contract that they play on for 4-5 years before being eligible for a new one. Typically, the rookie contracts are farily cheap, and if the player is good enough they either get extended or sign with another team, typically for a lot more money then they were making in the past. This is a very important thing to distinguish in our data. We will explore this further with a k means analysis which will give us a good idea of the split between our clusters.

In [1041]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler


# Create a dataset by stacking the features
data = passing_df[['AAV', 'yards']]

# Perform k-means clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Add cluster labels to the DataFrame
passing_df["Cluster"] = labels

# Plot the scatter plot
plt.scatter(passing_df["yards"], passing_df["AAV"], c=labels, cmap='viridis')
plt.scatter(centroids[:, 1], centroids[:, 0], c='red', marker='x', s=100)

# Set labels and title
plt.ylabel('Average Annual Value (AAV)')
plt.xlabel('Yards')
plt.title('K-means Clustering on AAV')
plt.ticklabel_format(style='plain')

# Display the plot
plt.show()

We get a very clear and obvious cluster, and the data within each cluster is vastly different with the yellow cluster being more condensed and the purple cluster being a lot more spread out.

In [1042]:
from sklearn.linear_model import LinearRegression

data = passing_df[["yards", "AAV", "Cluster"]]

# Split the DataFrame into clusters
cluster_0_df = passing_df[passing_df["Cluster"] == 0]
cluster_1_df = passing_df[passing_df["Cluster"] == 1]

# Perform linear regression within each cluster
linear_regression_0 = LinearRegression()
linear_regression_0.fit(cluster_0_df[["yards"]], cluster_0_df["AAV"])

linear_regression_1 = LinearRegression()
linear_regression_1.fit(cluster_1_df[["yards"]], cluster_1_df["AAV"])

# Print the coefficients for each cluster
print("Cluster 0: Yards Run Coefficient =", linear_regression_0.coef_)
print("Cluster 1: Yards Run Coefficient =", linear_regression_1.coef_)
Cluster 0: Yards Run Coefficient = [598.48838876]
Cluster 1: Yards Run Coefficient = [101.41286478]
In [1043]:
import seaborn as sns

# Create a violin plot
sns.violinplot(x="Cluster", y="yards", data=data)

# Set labels and title
plt.xlabel('Cluster')
plt.ylabel('Yards Run')
plt.title('Yards Run Distribution by Cluster')

plt.show()

The above violin plot shows that the above 20 million players have a much higher yards run rate than the under 20 million players. This could be explained by how often players are allowed to play. Big name players get paid more, and get more opportunities to run more yards than lower paid players.

In [1044]:
data = passing_df[["yards", "AAV", "Cluster", "player_game_count"]]

# Calculate yards per game
data["Yards Per Game"] = data["yards"] / data["player_game_count"]

# Create a violin plot
sns.violinplot(x="Cluster", y="Yards Per Game", data=data)

# Set labels and title
plt.xlabel('Cluster')
plt.ylabel('Yards Per Game')
plt.title('Yards Per Game Distribution by Cluster')

plt.show()

This gives us a better picture of how our clusters are distributed. In our left cluster we have more datapoints focused within one area at around 250 yards per game. Our other cluster is a tad more spread with the bulk of datapoints lying around 200 yards per game. We can clearly see there is a correlation between getting paid more and throwing for more yards. But does this mean that we can correlate well within our clusters to find more insightful discoveries? Let's explore further. First let's average out our quarterbacks stats rather than have them seperate by year in order to get a more general picture.

In [1045]:
# create one row per player with their average for each feature
grouped_qbs = passing_df.groupby('player')[numeric_cols].mean()
grouped_qbs
Out[1045]:
player_id player_game_count accuracy_percent aimed_passes attempts avg_depth_of_target avg_time_to_throw bats big_time_throws btt_rate ... Rank Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year
player
Aaron Rodgers 2241.0 15.500000 76.533333 485.500000 530.500000 8.416667 2.763333 3.833333 34.166667 5.850000 ... 11.0 38.0 3.0 150815000.0 50271667.0 40800000.0 101515000.0 150815000.0 2022.0 2024.0
Andy Dalton 6187.0 12.166667 73.333333 359.166667 389.500000 8.300000 2.525000 7.833333 16.833333 3.950000 ... 492.0 35.0 2.0 10000000.0 5000000.0 4835000.0 8000000.0 8000000.0 2023.0 2024.0
Baker Mayfield 46518.0 14.800000 72.260000 427.800000 466.000000 8.780000 2.806000 11.800000 26.200000 5.040000 ... 986.0 27.0 1.0 4000000.0 4000000.0 2875000.0 4000000.0 4000000.0 2023.0 2023.0
Blaine Gabbert 6162.0 5.500000 66.050000 129.500000 136.000000 8.750000 2.435000 3.500000 5.000000 3.300000 ... 1832.0 33.0 1.0 1317500.0 1317500.0 152500.0 1092500.0 1092500.0 2023.0 2023.0
Brian Hoyer 5277.0 8.000000 68.800000 202.000000 211.000000 8.400000 2.390000 3.000000 9.000000 4.100000 ... 836.0 37.0 2.0 4500000.0 2250000.0 875000.0 4205000.0 4205000.0 2023.0 2024.0
Brock Purdy 77672.0 10.000000 77.400000 212.000000 233.000000 7.800000 2.840000 7.000000 5.000000 2.000000 ... 1164.0 22.0 4.0 3737008.0 934252.0 77008.0 77008.0 77008.0 2022.0 2025.0
C.J. Beathard 11859.0 6.000000 73.566667 150.333333 165.666667 7.933333 2.550000 5.333333 5.000000 2.933333 ... 836.0 29.0 2.0 4500000.0 2250000.0 1000000.0 2000000.0 2000000.0 2023.0 2024.0
Case Keenum 7323.0 14.000000 74.800000 430.333333 467.333333 7.766667 2.626667 6.000000 18.666667 3.766667 ... 634.0 35.0 2.0 6250000.0 3125000.0 750000.0 4000000.0 4000000.0 2023.0 2024.0
Colt McCoy 5609.0 4.000000 75.800000 128.000000 132.000000 6.400000 2.500000 1.000000 5.000000 3.500000 ... 584.0 35.0 2.0 7500000.0 3750000.0 2500000.0 6000000.0 6000000.0 2022.0 2023.0
Cooper Rush 12324.0 8.000000 69.400000 144.000000 162.000000 8.900000 2.430000 7.000000 3.000000 1.800000 ... 753.0 29.0 2.0 5000000.0 2500000.0 1250000.0 2750000.0 2750000.0 2023.0 2024.0
Dak Prescott 10769.0 14.333333 75.900000 471.666667 500.333333 8.616667 2.743333 6.500000 22.333333 4.250000 ... 8.0 27.0 4.0 160000000.0 40000000.0 66000000.0 95000000.0 126000000.0 2021.0 2024.0
Daniel Jones 39395.0 14.000000 75.425000 413.500000 450.500000 7.550000 2.812500 8.000000 14.750000 3.075000 ... 8.0 25.0 4.0 160000000.0 40000000.0 36000000.0 81000000.0 92000000.0 2023.0 2026.0
David Blough 46508.0 5.000000 68.600000 159.000000 174.000000 8.900000 2.890000 6.000000 7.000000 3.700000 ... 1895.0 27.0 1.0 1232500.0 1232500.0 76250.0 76250.0 76250.0 2023.0 2023.0
Davis Mills 52269.0 14.000000 72.250000 402.500000 436.500000 8.100000 2.615000 10.000000 18.500000 3.950000 ... 714.0 22.0 4.0 5217531.0 1304383.0 1157892.0 1157892.0 1157892.0 2021.0 2024.0
Derek Carr 8671.0 16.000000 76.716667 506.000000 546.666667 8.166667 2.656667 8.000000 26.500000 4.533333 ... 12.0 31.0 4.0 150000000.0 37500000.0 28500000.0 60000000.0 100000000.0 2023.0 2026.0
Deshaun Watson 11767.0 12.600000 73.740000 383.000000 408.800000 9.360000 3.094000 4.400000 23.800000 5.220000 ... 7.0 26.0 5.0 230000000.0 46000000.0 44965000.0 230000000.0 230000000.0 2022.0 2026.0
Desmond Ridder 55086.0 4.000000 68.500000 108.000000 115.000000 8.400000 3.000000 1.000000 3.000000 2.500000 ... 697.0 22.0 4.0 5362959.0 1340740.0 1080336.0 1080336.0 1080336.0 2022.0 2025.0
Drew Lock 39517.0 8.000000 70.300000 219.666667 236.666667 9.066667 2.820000 3.666667 13.333333 4.500000 ... 986.0 26.0 1.0 4000000.0 4000000.0 1750000.0 1750000.0 1750000.0 2023.0 2023.0
Gardner Minshew 41401.0 11.500000 73.200000 366.000000 398.500000 8.200000 2.840000 5.000000 17.000000 4.150000 ... 1233.0 26.0 1.0 3500000.0 3500000.0 0.0 3500000.0 3500000.0 2023.0 2023.0
Geno Smith 7820.0 18.000000 77.700000 565.000000 607.000000 8.300000 2.790000 8.000000 35.000000 5.400000 ... 63.0 32.0 3.0 75000000.0 25000000.0 0.0 27300000.0 40000000.0 2023.0 2025.0
Jacoby Brissett 10725.0 13.000000 74.100000 343.500000 377.250000 8.400000 2.907500 4.750000 13.250000 3.375000 ... 561.0 30.0 1.0 8000000.0 8000000.0 4500000.0 7500000.0 7500000.0 2023.0 2023.0
Jalen Hurts 40291.0 14.666667 71.633333 351.333333 390.000000 9.433333 3.136667 9.333333 19.333333 4.633333 ... 4.0 24.0 5.0 255000000.0 51000000.0 23294000.0 110000000.0 179399000.0 2023.0 2028.0
Jameis Winston 9434.0 10.000000 71.100000 321.400000 344.400000 10.940000 2.818000 4.600000 17.200000 5.120000 ... 986.0 29.0 1.0 4000000.0 4000000.0 2835000.0 4000000.0 4000000.0 2023.0 2023.0
Jared Goff 10635.0 16.500000 75.483333 528.833333 582.333333 7.850000 2.720000 11.333333 23.666667 3.783333 ... 17.0 24.0 4.0 134000000.0 33500000.0 25000000.0 57042682.0 110042682.0 2019.0 2024.0
Jimmy Garoppolo 8697.0 12.000000 75.900000 316.200000 334.800000 7.540000 2.566000 6.800000 9.600000 2.760000 ... 69.0 31.0 3.0 72750000.0 24250000.0 11250000.0 33750000.0 45000000.0 2023.0 2025.0
Joe Burrow 28022.0 16.333333 76.866667 553.333333 593.666667 8.233333 2.573333 12.666667 30.333333 4.700000 ... 185.0 23.0 4.0 36190137.0 9047534.0 23880100.0 36190137.0 36190137.0 2020.0 2024.0
Joe Flacco 4332.0 8.600000 73.400000 272.600000 303.000000 8.620000 2.614000 7.200000 10.200000 3.400000 ... 1233.0 36.0 1.0 3500000.0 3500000.0 1730000.0 2850000.0 2850000.0 2022.0 2022.0
Josh Allen 46601.0 17.000000 72.600000 528.800000 575.000000 9.960000 2.992000 10.400000 34.600000 5.460000 ... 135.5 23.0 5.0 140388319.0 24345663.5 15530050.0 61390617.0 86371319.0 2020.0 2025.5
Justin Fields 82118.0 13.500000 69.300000 271.500000 294.000000 9.800000 3.255000 6.500000 15.500000 4.950000 ... 327.0 22.0 4.0 18871957.0 4717989.0 11085060.0 18871957.0 18871957.0 2021.0 2025.0
Justin Herbert 28237.0 16.666667 77.233333 616.666667 670.000000 7.533333 2.690000 17.333333 27.333333 3.933333 ... 243.0 22.0 4.0 26578755.0 6644689.0 16890004.0 26578755.0 26578755.0 2020.0 2024.0
Kenny Pickett 60515.0 13.000000 73.300000 360.000000 389.000000 8.000000 2.990000 8.000000 18.000000 4.300000 ... 393.0 23.0 4.0 14067905.0 3516976.0 7411204.0 14067905.0 14067905.0 2022.0 2026.0
Kirk Cousins 7102.0 16.500000 77.850000 520.833333 568.166667 8.250000 2.695000 11.333333 26.333333 4.366667 ... 191.0 33.0 1.0 35000000.0 35000000.0 25000000.0 35000000.0 35000000.0 2022.0 2023.0
Kyle Allen 26152.0 13.000000 73.300000 450.000000 489.000000 8.600000 2.690000 7.000000 20.000000 3.900000 ... 1895.0 27.0 1.0 1232500.0 1232500.0 100000.0 350000.0 350000.0 2023.0 2023.0
Kyler Murray 38334.0 14.500000 76.025000 458.500000 501.000000 7.825000 2.737500 10.250000 25.500000 4.675000 ... 6.0 24.0 5.0 230500000.0 46100000.0 29035000.0 103300000.0 159797000.0 2022.0 2028.0
Lamar Jackson 46416.0 14.000000 72.860000 331.400000 358.200000 9.380000 3.110000 7.800000 18.400000 4.720000 ... 2.0 26.0 5.0 260000000.0 52000000.0 72500000.0 135000000.0 185000000.0 2023.0 2027.0
Mac Jones 60323.0 16.000000 74.650000 468.500000 500.500000 8.250000 2.655000 8.500000 24.000000 4.550000 ... 365.0 22.0 4.0 15586352.0 3896588.0 8695528.0 15586352.0 15586352.0 2021.0 2025.0
Marcus Mariota 9435.0 13.000000 72.175000 309.000000 328.250000 8.875000 2.875000 3.500000 14.250000 3.675000 ... 753.0 29.0 1.0 5000000.0 5000000.0 3835000.0 5000000.0 5000000.0 2023.0 2023.0
Mason Rudolph 46569.0 10.000000 74.500000 259.000000 283.000000 8.100000 2.740000 3.000000 6.000000 2.000000 ... 737.0 25.0 1.0 5080000.0 5080000.0 2080000.0 2080000.0 2080000.0 2021.0 2022.0
Matthew Stafford 4924.0 14.333333 74.033333 471.166667 497.166667 8.766667 2.598333 7.666667 30.000000 5.800000 ... 8.0 34.0 4.0 160000000.0 40000000.0 60000000.0 63000000.0 130000000.0 2022.0 2026.0
Mike White 46453.0 4.000000 73.000000 140.500000 153.500000 7.200000 2.565000 5.000000 5.500000 3.300000 ... 561.0 27.0 2.0 8000000.0 4000000.0 3420000.0 4500000.0 4500000.0 2023.0 2024.0
Mitchell Trubisky 11757.0 11.800000 71.300000 348.200000 365.800000 8.840000 2.714000 3.600000 16.400000 4.380000 ... 388.0 27.0 2.0 14285000.0 7142500.0 5250000.0 5250000.0 5250000.0 2022.0 2023.0
Nick Mullens 12282.0 9.000000 76.900000 269.000000 300.000000 7.200000 2.515000 9.500000 7.500000 2.300000 ... 986.0 27.0 2.0 4000000.0 2000000.0 750000.0 1922500.0 1922500.0 2023.0 2024.0
Patrick Mahomes 11765.0 18.600000 77.860000 638.600000 696.000000 8.300000 2.854000 8.200000 41.000000 5.560000 ... 1.0 24.0 10.0 450000000.0 45000000.0 10000000.0 63081905.0 141481905.0 2020.0 2031.0
Russell Wilson 7077.0 16.166667 75.250000 466.500000 508.833333 9.850000 2.966667 7.833333 37.000000 6.750000 ... 5.0 33.0 5.0 242588236.0 48517647.0 50000000.0 124000000.0 161000000.0 2022.0 2028.0
Ryan Tannehill 7014.0 14.400000 75.800000 372.000000 401.400000 8.720000 2.688000 8.800000 16.800000 3.920000 ... 21.0 31.0 4.0 118000000.0 29500000.0 20000000.0 62000000.0 91000000.0 2020.0 2023.0
Sam Darnold 29048.0 11.200000 72.480000 315.400000 353.000000 8.700000 2.896000 5.600000 13.800000 3.940000 ... 836.0 25.0 1.0 4500000.0 4500000.0 1500000.0 3500000.0 3500000.0 2023.0 2023.0
Sam Ehlinger 60555.0 4.000000 71.100000 97.000000 101.000000 7.800000 3.020000 2.000000 3.000000 2.900000 ... 1201.0 22.0 4.0 3610708.0 902677.0 130708.0 130708.0 130708.0 2021.0 2024.0
Skylar Thompson 60417.0 8.000000 64.400000 135.000000 150.000000 9.000000 3.040000 1.000000 4.000000 2.500000 ... 1159.0 24.0 4.0 3742216.0 935554.0 82216.0 82216.0 82216.0 2022.0 2025.0
Taylor Heinicke 9896.0 12.500000 72.150000 347.500000 376.500000 8.700000 2.945000 7.500000 13.500000 3.400000 ... 398.0 29.0 2.0 14000000.0 7000000.0 4000000.0 6320000.0 6320000.0 2023.0 2024.0
Taysom Hill 12112.0 12.000000 77.900000 119.000000 127.500000 7.700000 2.970000 1.000000 2.500000 1.900000 ... 163.0 31.0 4.0 40000000.0 10000000.0 0.0 10100000.0 21500000.0 2021.0 2025.0
Teddy Bridgewater 8667.0 12.000000 79.000000 342.666667 371.333333 7.366667 2.770000 7.333333 12.333333 3.200000 ... 621.0 29.0 1.0 6500000.0 6500000.0 2000000.0 6500000.0 6500000.0 2022.0 2022.0
Trevor Lawrence 77632.0 18.000000 74.350000 585.500000 636.000000 8.050000 2.670000 13.000000 28.500000 4.200000 ... 182.0 21.0 4.0 36793488.0 9198372.0 24118900.0 27598900.0 36793488.0 2021.0 2025.0
Tua Tagovailoa 60326.0 12.000000 74.566667 341.000000 359.333333 8.500000 2.523333 2.666667 11.666667 3.000000 ... 221.0 22.0 4.0 30275438.0 7568860.0 19578500.0 30275438.0 30275438.0 2020.0 2024.0
Tyler Huntley 46448.0 6.000000 73.350000 152.000000 164.500000 7.750000 2.905000 4.000000 3.000000 1.750000 ... 1341.0 25.0 1.0 2627000.0 2627000.0 0.0 0.0 0.0 2023.0 2023.0
Tyrod Taylor 6332.0 11.000000 70.950000 280.000000 303.500000 8.150000 2.990000 8.500000 10.000000 2.900000 ... 480.0 29.0 2.0 11000000.0 5500000.0 4200000.0 8170000.0 8170000.0 2022.0 2023.0
Zach Wilson 82096.0 11.000000 67.800000 279.500000 312.000000 8.850000 3.060000 8.000000 9.000000 2.850000 ... 190.0 21.0 4.0 35150681.0 8787670.0 22924132.0 35150681.0 35150681.0 2021.0 2025.0

56 rows × 50 columns

Now we want to see the relationship between a players average yards and their salary.

In [1046]:
plt.cla()
m, b = np.polyfit(grouped_qbs['yards'],grouped_qbs['AAV'] , 1)
plt.scatter(grouped_qbs['yards'], grouped_qbs['AAV'],)
plt.plot(grouped_qbs['yards'], grouped_qbs['yards']*m + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('Player average yards vs AAV')
Out[1046]:
Text(0.5, 1.0, 'Player average yards vs AAV')

Now let's analyze our clusters. Let's start with the less than 20 million cluster.

In [1047]:
plt.cla()
# lower salary cluster
m, b = np.polyfit(cluster_1_df['yards'], cluster_1_df['AAV'], 1)
plt.scatter(cluster_1_df['yards'], cluster_1_df['AAV'])
plt.plot(cluster_1_df['yards'], cluster_1_df['yards']*m + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('lower cluster yards VS AAV')
Out[1047]:
Text(0.5, 1.0, 'lower cluster yards VS AAV')

More than 20 million cluster

In [1048]:
plt.cla()
# upper salary cluster
m, b = np.polyfit( cluster_0_df['yards'], cluster_0_df['AAV'], 1)
plt.scatter(cluster_0_df['yards'],cluster_0_df['AAV'],)
plt.plot(cluster_0_df['yards'], cluster_0_df['yards']*m + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('upper cluster yards VS AAV')
Out[1048]:
Text(0.5, 1.0, 'upper cluster yards VS AAV')

Now let's look at the relationship between players before they receive their contract.

In [1049]:
# yards and average salary relationship
plt.cla()
m, b = np.polyfit(passing_df[passing_df['year'] < passing_df['start_year']]['yards'], passing_df[passing_df['year'] < passing_df['start_year']]["AAV"],1)
plt.scatter(passing_df[passing_df['year'] < passing_df['start_year']]['yards'], passing_df[passing_df['year'] < passing_df['start_year']]["AAV"],)
plt.plot(passing_df[passing_df['year'] < passing_df['start_year']]['yards'], m*passing_df[passing_df['year'] < passing_df['start_year']]['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('Yards before contract year VS AAV')
Out[1049]:
Text(0.5, 1.0, 'Yards before contract year VS AAV')

Now let's look at how players perform after they receive their contracts.

In [1050]:
# yards and average salary relationship
plt.cla()
m, b = np.polyfit(passing_df[passing_df['year'] >= passing_df['start_year']]['yards'], passing_df[passing_df['year'] >= passing_df['start_year']]["AAV"],1)
plt.scatter(passing_df[passing_df['year'] >= passing_df['start_year']]['yards'], passing_df[passing_df['year'] >= passing_df['start_year']]["AAV"],)
plt.plot(passing_df[passing_df['year'] >= passing_df['start_year']]['yards'], m*passing_df[passing_df['year'] >= passing_df['start_year']]['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('yards after contract year VS AAV')
Out[1050]:
Text(0.5, 1.0, 'yards after contract year VS AAV')

Now let's explore the upper cluster, the quarterbacks who have received a notable second contract. Understanding these players will help us use our data to figure out how to better predict what contracts should look like.

In [1051]:
# yards before contract and average salary relationship
plt.cla()
m, b = np.polyfit(cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]['yards'], cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]["AAV"],1)
plt.scatter(cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]['yards'], cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]["AAV"],)
plt.plot(cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]['yards'], m*cluster_0_df[cluster_0_df['year'] < cluster_0_df['start_year']]['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('upper cluster yards before contract VS AAV')
Out[1051]:
Text(0.5, 1.0, 'upper cluster yards before contract VS AAV')
In [1052]:
# yards after contract and average salary relationship
plt.cla()
m, b = np.polyfit(cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]['yards'], cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]["AAV"],1)
plt.scatter(cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]['yards'], cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]["AAV"],)
plt.plot(cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]['yards'], m*cluster_0_df[cluster_0_df['year'] >= cluster_0_df['start_year']]['yards'] + b, color='red')
plt.xlabel('yards')
plt.ylabel('AAV')
plt.title('upper cluster after contract VS AAV')
Out[1052]:
Text(0.5, 1.0, 'upper cluster after contract VS AAV')
In [1053]:
# pff grade and average salary relationship
plt.cla()
m, b = np.polyfit(cluster_0_df['grades_offense'], cluster_0_df["AAV"],1)
plt.scatter(cluster_0_df['grades_offense'], cluster_0_df["AAV"],)
plt.plot(cluster_0_df['grades_offense'], m*cluster_0_df['grades_offense'] + b, color='red')
plt.xlabel('PFF Grade')
plt.ylabel('AAV')
plt.title('upper cluster after before contract VS AAV')
Out[1053]:
Text(0.5, 1.0, 'upper cluster after before contract VS AAV')
In [1054]:
# big time throws and average salary relationship
plt.cla()
m, b = np.polyfit(cluster_0_df['big_time_throws'], cluster_0_df["AAV"],1)
plt.scatter(cluster_0_df['big_time_throws'], cluster_0_df["AAV"],)
plt.plot(cluster_0_df['big_time_throws'], m*cluster_0_df['big_time_throws'] + b, color='red')
plt.xlabel('big time throws')
plt.ylabel('AAV')
plt.title('upper cluster after before contract VS AAV')
Out[1054]:
Text(0.5, 1.0, 'upper cluster after before contract VS AAV')

There appears to be high correlation between salary and yards for players who are paid less. However, for players that are paid high, there is negiligeable correlation, which means that it is both not strong and not a good predictor for future salary. However, we do see slightly more promise with other features such as big time throws and pff grade. We will now further explore feature selection in order to see if we can select a good amounut of features that will be able to predict salary together. We want to now figure out any features that are redundant with each other or irrelevant to predicting the salary. Understanding our data in this way will allow for cleaner built models that converge quicker, generalize and produce more accurate results more efficiently. We will start by seeing how our features correlate with the AAV and to drop features in which there are none.

In [1055]:
targets = ['Value', 'AAV', 'Sign Bonus', "G'teed @ Sign", "Practical G'teed"]
# features = passing_df.select_dtypes(include=[np.number])
aav_corr = passing_df[list(features.columns) + ['AAV']].corr()['AAV']
abs(aav_corr).sort_values()
Out[1055]:
grades_hands_fumble      0.000380
declined_penalties       0.011715
sack_percent             0.060447
hit_as_threw             0.071293
pressure_to_sack_rate    0.084187
interceptions            0.087623
avg_depth_of_target      0.101817
bats                     0.127425
drop_rate                0.133226
spikes                   0.202030
turnover_worthy_plays    0.223162
grades_run               0.230031
avg_time_to_throw        0.230634
twp_rate                 0.236639
penalties                0.244877
completion_percent       0.288256
accuracy_percent         0.305657
drops                    0.308714
sacks                    0.353610
scrambles                0.365247
btt_rate                 0.381945
ypa                      0.400228
thrown_aways             0.403218
aimed_passes             0.408570
def_gen_pressures        0.424604
dropbacks                0.425085
completions              0.427110
player_game_count        0.432702
first_downs              0.457631
yards                    0.458107
grades_pass              0.459533
big_time_throws          0.469766
grades_offense           0.482950
qb_rating                0.498848
touchdowns               0.503470
AAV                      1.000000
Name: AAV, dtype: float64

The cutoff we will decide on is spikes, which based on intuition about football has very little to do with analying performance of a player. Everything with less correlation to AAV than that will also be dropped.

In [1056]:
aav_corr = aav_corr[aav_corr > 0.21]
features = features[aav_corr.drop('AAV').index]
aav_corr
Out[1056]:
player_game_count        0.432702
accuracy_percent         0.305657
aimed_passes             0.408570
avg_time_to_throw        0.230634
big_time_throws          0.469766
btt_rate                 0.381945
completion_percent       0.288256
completions              0.427110
def_gen_pressures        0.424604
dropbacks                0.425085
drops                    0.308714
first_downs              0.457631
grades_offense           0.482950
grades_pass              0.459533
grades_run               0.230031
penalties                0.244877
qb_rating                0.498848
sacks                    0.353610
scrambles                0.365247
thrown_aways             0.403218
touchdowns               0.503470
turnover_worthy_plays    0.223162
yards                    0.458107
ypa                      0.400228
AAV                      1.000000
Name: AAV, dtype: float64

Now we have reduced our feature set a little bit. Now we want to remove features that are redundant with each other. To do this we will use the pandas scatter matrix plotting function to visualise this intuitively. What this does is that it will plot the correlation between every feature individually. With this we can see features that are highly correlated and therefore redundant with each other.

In [1057]:
pd.plotting.scatter_matrix(features[aav_corr.drop('AAV').index], diagonal='kde', figsize=(30, 30))
for ax in plt.gcf().axes:
    ax.tick_params(labelsize=12)
    ax.xaxis.label.set_fontsize(7)
    ax.yaxis.label.set_fontsize(7)
plt.show()

From this graph we can observe a few datapoints to drop. First off aimed passes is a highly redundant feature so we will drop it. Other features are similar, including completions, dropbacks, first downs, big time throws and yards. Because big time throws is the most correlated with AAV we will keep that one and drop the others. Offense and passing grades are also highly redundant as passing grade is a subset of the offensive grade, we will drop this. We can see some correlation in other areas but we will leave those for now. This leaves us with now 18 features.

In [1058]:
features = features.drop(['aimed_passes', 'completions', 'dropbacks', 'first_downs', 'grades_pass', 'yards'], axis=1)
corr_matrix = features.corr()
corr_matrix
Out[1058]:
player_game_count accuracy_percent avg_time_to_throw big_time_throws btt_rate completion_percent def_gen_pressures drops grades_offense grades_run penalties qb_rating sacks scrambles thrown_aways touchdowns turnover_worthy_plays ypa
player_game_count 1.000000 0.437641 0.099989 0.772411 0.351008 0.371896 0.888488 0.794328 0.621176 0.282298 0.523688 0.482449 0.744253 0.577548 0.742027 0.816649 0.706002 0.356817
accuracy_percent 0.437641 1.000000 -0.239002 0.357308 0.071976 0.847651 0.361914 0.380525 0.589884 0.134533 0.184160 0.672444 0.282167 0.122394 0.341557 0.493866 0.126579 0.432872
avg_time_to_throw 0.099989 -0.239002 1.000000 0.077914 0.174100 -0.307968 0.186844 -0.024387 0.056687 0.331428 0.135266 -0.031181 0.175699 0.583485 0.202739 0.030019 0.020522 0.008194
big_time_throws 0.772411 0.357308 0.077914 1.000000 0.787101 0.350747 0.781584 0.686443 0.745718 0.290445 0.400217 0.569391 0.569285 0.522322 0.668408 0.877429 0.589140 0.455350
btt_rate 0.351008 0.071976 0.174100 0.787101 1.000000 0.146743 0.385275 0.281415 0.590632 0.235507 0.218078 0.477410 0.280548 0.335089 0.327626 0.561176 0.250768 0.455223
completion_percent 0.371896 0.847651 -0.307968 0.350747 0.146743 1.000000 0.284873 0.164711 0.555226 0.075198 0.143891 0.727730 0.242659 0.092600 0.140309 0.475021 0.111345 0.507245
def_gen_pressures 0.888488 0.361914 0.186844 0.781584 0.385275 0.284873 1.000000 0.770062 0.565647 0.297715 0.439443 0.396939 0.799194 0.668934 0.751959 0.781361 0.735876 0.278926
drops 0.794328 0.380525 -0.024387 0.686443 0.281415 0.164711 0.770062 1.000000 0.513224 0.222321 0.348474 0.266030 0.563719 0.448961 0.647000 0.688533 0.661611 0.156339
grades_offense 0.621176 0.589884 0.056687 0.745718 0.590632 0.555226 0.565647 0.513224 1.000000 0.441164 0.275276 0.740568 0.392531 0.436743 0.475274 0.752588 0.236734 0.607530
grades_run 0.282298 0.134533 0.331428 0.290445 0.235507 0.075198 0.297715 0.222321 0.441164 1.000000 0.142861 0.215324 0.239035 0.528501 0.206591 0.305779 0.182653 0.168180
penalties 0.523688 0.184160 0.135266 0.400217 0.218078 0.143891 0.439443 0.348474 0.275276 0.142861 1.000000 0.180349 0.443809 0.311807 0.494594 0.387091 0.376580 0.116272
qb_rating 0.482449 0.672444 -0.031181 0.569391 0.477410 0.727730 0.396939 0.266030 0.740568 0.215324 0.180349 1.000000 0.275732 0.254520 0.323622 0.732861 0.167743 0.795439
sacks 0.744253 0.282167 0.175699 0.569285 0.280548 0.242659 0.799194 0.563719 0.392531 0.239035 0.443809 0.275732 1.000000 0.571604 0.547841 0.530947 0.575722 0.227301
scrambles 0.577548 0.122394 0.583485 0.522322 0.335089 0.092600 0.668934 0.448961 0.436743 0.528501 0.311807 0.254520 0.571604 1.000000 0.502622 0.507499 0.449490 0.174856
thrown_aways 0.742027 0.341557 0.202739 0.668408 0.327626 0.140309 0.751959 0.647000 0.475274 0.206591 0.494594 0.323622 0.547841 0.502622 1.000000 0.665953 0.529567 0.175107
touchdowns 0.816649 0.493866 0.030019 0.877429 0.561176 0.475021 0.781361 0.688533 0.752588 0.305779 0.387091 0.732861 0.530947 0.507499 0.665953 1.000000 0.607125 0.555451
turnover_worthy_plays 0.706002 0.126579 0.020522 0.589140 0.250768 0.111345 0.735876 0.661611 0.236734 0.182653 0.376580 0.167743 0.575722 0.449490 0.529567 0.607125 1.000000 0.192064
ypa 0.356817 0.432872 0.008194 0.455350 0.455223 0.507245 0.278926 0.156339 0.607530 0.168180 0.116272 0.795439 0.227301 0.174856 0.175107 0.555451 0.192064 1.000000
In [1059]:
features.columns
Out[1059]:
Index(['player_game_count', 'accuracy_percent', 'avg_time_to_throw',
       'big_time_throws', 'btt_rate', 'completion_percent',
       'def_gen_pressures', 'drops', 'grades_offense', 'grades_run',
       'penalties', 'qb_rating', 'sacks', 'scrambles', 'thrown_aways',
       'touchdowns', 'turnover_worthy_plays', 'ypa'],
      dtype='object')
In [1060]:
pd.plotting.scatter_matrix(features[corr_matrix.index], diagonal='kde', figsize=(30, 30))
for ax in plt.gcf().axes:
    ax.tick_params(labelsize=12)
    ax.xaxis.label.set_fontsize(7)
    ax.yaxis.label.set_fontsize(7)
plt.show()

Nothing seems blatenly correlated. Now we want to test how many features are considered valueable for our dataset. For this we will do a sequential feature selection and test out how effective the number of features are on a linear regression model. The SequencialFeatureSelection from sklearn is a greedy algorithm which will recursively select the next best feature for the model and go up to the specified amount. We looped over all possible number of features to see if it converged early so that we could possibly remove some features.

In [1061]:
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Create a LinearRegression model
lr = LinearRegression()

X_train, X_test, y_train, y_test = train_test_split(features, passing_df['AAV'], test_size=0.2, random_state=42)

for i in range(1, len(features.columns)):
    # Create a SequentialFeatureSelector
    sfs = SequentialFeatureSelector(lr, n_features_to_select=i, direction='forward')
    # Fit the SequentialFeatureSelector on the data
    sfs.fit(X_train, y_train)
    # Get the selected feature indices
    selected_features = sfs.get_support(indices=True)
    selected_feature_names = list(features.columns[selected_features])
    lr.fit(X_train[selected_feature_names], y_train)
    r_squared = lr.score(X_test[selected_feature_names], y_test)
    print(f'{i} features, R-squared: {r_squared}')  
1 features, R-squared: 0.244606316577741
2 features, R-squared: 0.09224184085127551
3 features, R-squared: 0.1254399814209034
4 features, R-squared: 0.12996871273586075
5 features, R-squared: 0.1220496126531656
6 features, R-squared: 0.1192598219799419
7 features, R-squared: 0.1291698097568209
8 features, R-squared: 0.11305164951905311
9 features, R-squared: 0.11170548528584112
10 features, R-squared: 0.08477864360019127
11 features, R-squared: 0.08356644818467918
12 features, R-squared: 0.08352904118312476
13 features, R-squared: 0.07484603528129596
14 features, R-squared: 0.05595573976768642
15 features, R-squared: 0.05740324208111103
16 features, R-squared: 0.03812686600087556
17 features, R-squared: 0.03288428094003615

There is a big jump from 15 to 16 features but there is not as big of one from 16 to 17, so we will move forward with the top 16 features.

In [1062]:
sfs = SequentialFeatureSelector(lr, n_features_to_select=16, direction='forward')
sfs.fit(X_train, y_train)
# Get the selected feature indices
selected_features = sfs.get_support(indices=True)
selected_feature_names = list(features.columns[selected_features]) + ['start_year']
features = passing_df[selected_feature_names]
features.columns
Out[1062]:
Index(['player_game_count', 'accuracy_percent', 'avg_time_to_throw',
       'big_time_throws', 'btt_rate', 'completion_percent', 'drops',
       'grades_offense', 'grades_run', 'penalties', 'qb_rating', 'sacks',
       'scrambles', 'thrown_aways', 'turnover_worthy_plays', 'ypa',
       'start_year'],
      dtype='object')

Now we will do some machine learning to create a model that can predict salary. We will use some of the observations we used from our exploration analysis to experiment with multiple models. For the most part, we will be using neural networks. We will be using tensorflow, a python library that creates neural networks and performs calculations into a model. It allows us to specify how many hidden layers and nodes we have. The finer details are handeled on the back end like the weight adjustments, which will determine how we get from input to output. For our model, we will use 1 output since our targets are already closely correlated. We will use relu activation function which will not allow us to have negative values ane we will normalize all our data. We will train on 80% of our data and hold out 20% for testing. We will have 2 hidden layers.

In [1063]:
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from sklearn.metrics import r2_score

# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, passing_df[['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(31, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
In [1064]:
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
In [1065]:
from sklearn.metrics import r2_score
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
2/2 [==============================] - 0s 17ms/step
R^2 score: -0.27057328396322666
In [1066]:
plt.cla()
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('QB NN')
plt.xlabel('test data')
plt.ylabel('predicted data')
Out[1066]:
Text(0, 0.5, 'predicted data')

Let's now train on how players perform before their contract.

In [1067]:
passing_df[passing_df['year'] < passing_df['start_year']]
features = passing_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, passing_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(31, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
In [1068]:
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
In [1069]:
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
1/1 [==============================] - 0s 48ms/step
R^2 score: -0.1760114269782802
In [1070]:
plt.cla()
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('QB NN before contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
Out[1070]:
Text(0, 0.5, 'predicted data')
In [1071]:
passing_df[passing_df['year'] < passing_df['start_year']]
features = cluster_1_df[passing_df['dropbacks'] > 300][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(31, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
In [1072]:
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
In [1073]:
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
1/1 [==============================] - 0s 48ms/step
R^2 score: 0.6845601843548529
In [1074]:
plt.cla()
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('Upper cluster QB NN')
plt.xlabel('test data')
plt.ylabel('predicted data')
Out[1074]:
Text(0, 0.5, 'predicted data')

Now lets train on our upper cluster before they receive their contracts.

In [1075]:
features = cluster_1_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(31, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
In [1076]:
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
In [1077]:
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score (test):', r2)
1/1 [==============================] - 0s 48ms/step
R^2 score (test): 0.6683308312916421
In [1078]:
nn = model
In [1079]:
plt.cla()
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('QB NN before contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
Out[1079]:
Text(0, 0.5, 'predicted data')

Lets train on quarterbacks after they receive their contracts.

In [1080]:
passing_df[passing_df['year'] < passing_df['start_year']]
features = passing_df[passing_df['start_year'] >= passing_df['year']][passing_df['dropbacks'] > 300][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, passing_df[passing_df['start_year'] >= passing_df['year']][passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(31, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
In [1081]:
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
In [1082]:
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
1/1 [==============================] - 0s 49ms/step
R^2 score: 0.21536589007912033
In [1083]:
plt.plot(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred), 'o')
plt.title('QB NN after contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
plt.show()
In [1084]:
passing_df[passing_df['year'] < passing_df['start_year']]
features = cluster_1_df[passing_df['start_year'] <= passing_df['year']][selected_feature_names]
# get training and testing data
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['start_year'] <= passing_df['year']][['AAV']], test_size=0.2, random_state=42)
# normalize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)

model = tf.keras.Sequential([
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(31, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(optimizer='adam', loss='mean_squared_error')
In [1085]:
history = model.fit(X_train_scaled, y_train, validation_data=(X_test_scaled, y_test), epochs=100, batch_size=32, verbose=False)
In [1086]:
y_pred = model.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
print('R^2 score:', r2)
1/1 [==============================] - 0s 51ms/step
R^2 score: -0.17128032098927815
In [1087]:
plt.scatter(scaler.inverse_transform(y_test), scaler.inverse_transform(y_pred))
plt.title('QB NN after contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
plt.show()

There simply isn't enough data to make any conclusions from here. From our neural network we found that both our upper cluster and players before their contract offer more accurate training data. The most accurate we got was the model where we trained on the cluster of quarterbacks that received their second contracts and look at how they played before their contracts. This proved to be a fairly accurate model in terms of being able to predict salary on our testing data. Now lets try a couple of other machine learning models starting with K-nearest neighbors. This is a model which takes a new data point and looks at the points around it that are known and makes a prediction based off of them. Sklearn has a function that allows us to do this while specifying the amount of neighbors we want to observe.

In [1088]:
from sklearn.neighbors import KNeighborsRegressor

features = passing_df[selected_feature_names]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, passing_df[['AAV']], test_size=0.2, random_state=42)

# Create KNN regressor
knn = KNeighborsRegressor(n_neighbors=20)

# Fit the model on the training data
knn.fit(X_train, y_train)

# Predict on the test data
y_pred = knn.predict(X_test)

# Evaluate the model
print("R^2 Score:", r2_score(y_test, y_pred))
R^2 Score: 0.2716735737097594
In [1089]:
features = cluster_1_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][selected_feature_names]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['start_year'] > passing_df['year']][passing_df['dropbacks'] > 300][['AAV']], test_size=0.2, random_state=42)

# Create KNN regressor
knn = KNeighborsRegressor(n_neighbors=5)

# Fit the model on the training data
knn.fit(X_train, y_train)

# Predict on the test data
y_pred = knn.predict(X_test)

# Evaluate the model
print("R^2 Score:", r2_score(y_test, y_pred))
R^2 Score: 0.687334875311706
In [1090]:
plt.plot(y_test, y_pred, 'o')
plt.title('QB NN KNN contract year')
plt.xlabel('test data')
plt.ylabel('predicted data')
plt.show()

This is comparable to our neural net but still a pretty good model in terms of what we're looking for. Now let's try linear regression.

In [1091]:
features = passing_df[selected_feature_names]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, passing_df['AAV'], test_size=0.2, random_state=42)

# Create a linear regression model
lr = LinearRegression()

# Fit the model to the training data
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
print("R^2 Score:", r2_score(y_test, y_pred))
R^2 Score: 0.00616578919961297
In [1092]:
features = cluster_1_df[passing_df['start_year'] > passing_df['year']][selected_feature_names]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, cluster_1_df[passing_df['start_year'] > passing_df['year']][['AAV']], test_size=0.2, random_state=42)

# Create a linear regression model
lr = LinearRegression()

# Fit the model to the training data
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
print("R^2 Score:", r2_score(y_test, y_pred))
R^2 Score: 0.15657887317565966
In [1093]:
plt.plot(y_test, y_pred, 'o')
plt.title('QB NN KNN contract year linear regression')
plt.xlabel('test data')
plt.ylabel('predicted data')
plt.show()

This is clearly not as good as our KNN or neural network. Now let's use our models to predict the average salary of the next two quarterbacks anticipated to get massive contracts, Joe Burrow and Justin Herbert.

In [1094]:
justin_herbert = passing_df[passing_df['player'] == 'Justin Herbert']
justin_herbert
Out[1094]:
player player_id position team_name player_game_count accuracy_percent aimed_passes attempts avg_depth_of_target avg_time_to_throw ... Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year Cluster
5 Justin Herbert 28237 QB LAC 18 78.8 673 743 6.9 2.74 ... 22 4 26578755 6644689 16890004 26578755 26578755 2020 2024 0
6 Justin Herbert 28237 QB LAC 17 77.1 624 672 7.9 2.72 ... 22 4 26578755 6644689 16890004 26578755 26578755 2020 2024 0
7 Justin Herbert 28237 QB LAC 15 75.8 553 595 7.8 2.61 ... 22 4 26578755 6644689 16890004 26578755 26578755 2020 2024 0

3 rows × 54 columns

In [1095]:
joe_burrow = passing_df[passing_df['player'] == 'Joe Burrow']
joe_burrow
Out[1095]:
player player_id position team_name player_game_count accuracy_percent aimed_passes attempts avg_depth_of_target avg_time_to_throw ... Signed Age Yrs Value AAV Sign Bonus G'teed @ Sign Practical G'teed start_year end_year Cluster
8 Joe Burrow 28022 QB CIN 19 77.3 660 715 7.6 2.50 ... 23 4 36190137 9047534 23880100 36190137 36190137 2020 2024 0
9 Joe Burrow 28022 QB CIN 20 79.8 619 662 8.2 2.62 ... 23 4 36190137 9047534 23880100 36190137 36190137 2020 2024 0
10 Joe Burrow 28022 QB CIN 10 73.5 381 404 8.9 2.60 ... 23 4 36190137 9047534 23880100 36190137 36190137 2020 2024 0

3 rows × 54 columns

In [1096]:
print("Justin Herbert estimated contract:", max(knn.predict(justin_herbert[selected_feature_names]))[0])
Justin Herbert estimated contract: 45401133.4
In [1098]:
print("Joe Burrow estimated contract:", max(knn.predict(joe_burrow[selected_feature_names]))[0])
Joe Burrow estimated contract: 41358996.2

Based on real life intuition, these predictions are a bit on the lower side, possibly due to not being able to account for inflation without limiting the dataset too much. After exploring and analyzing the NFL player stats data, we observed several interesting findings. We found that certain features like yards, completions and dropbacks were highly correlated and not needed in a predictive model. Additionally, we were able to use machine learning techniques such as neural networks and K-Nearest Neighbors to predict player performance based on their data. This project goes through the data science lifecycle. We started with data collection and processing, moving on to exploration to better understand the data and identify any issues or missing values. We went through feature selection, selecting and transforming relevant features to improve model performance. Next, we used various machine learning algorithms to build models and make predictions on player performance. Overall, this project demonstrates the importance of using exploratory data analysis and machine learning techniques to gain insights from complex data. It highlights the iterative nature of the data science lifecycle, where each step informs and improves the subsequent steps, ultimately leading to a better understanding of the data and better models. We also proved a hypothesis that based on certain data we can predict players contracts on a certain interval of confidence.